inbound: Improve policy metrics #1237

olix0r · 2021-09-02T20:57:02Z

We recently introduced metrics to help surface inbound policy decisions,
but in practice these haven't been as useful as we might hope.
Specifically, error metrics do not include the target_addr label so
these metrics can't be correlated with servers, etc. This change
improves error metrics and also introduces new metrics to describe
authorization decisions: authorization denials shouldn't be classified
as errors, really, anyway.

This change also improves TCP forwarding authorization so that policy
changes can be honored at runtime: previously authorized connections may
dropped if the policy is updated so that the connection is no longer
authorized.

The gateway is also updated to enforce HTTP policies at runtime as well
so that policy changes can be honored after the connection has been
established.

This change introduces new metrics:

inbound_http_authz_allow_total
inbound_http_authz_deny_total
inbound_tcp_authz_allow_total
inbound_tcp_authz_deny_total
inbound_tcp_authz_terminate_total

allow metrics include target_addr, srv_name, and saz_name
labels. deny and terminate metics include only target_addr and
srv_name labels.

Authorization denials are no longer reflected in inbound_tcp_error or
inbound_http_error metrics.

A number of internal changes have been made to support this:

The inbound::policy::authorize module includes middlewares for TCP
and HTTP authorization, replacing the prior method of enforcing policy
in the stack/router. This module ensures that metrics are recorded for
policy decisions.
The error-metrics crate has been removed. In its place a monitor
type has been added to the stack crate, supporting a general way to
observe errors, decoupled from the metrics registry.
Inbound and outbound error metrics are now tracked in the inbound and
outbound crates, respectively. Inbound- and outbound-specific error
types are also moved into their rspective crates.
The app_core::errors module has been updated to only define the
types it needs to instrument the error response layer. Error responses
are now primarily instrumented via the HttpError type so that errors
that should be handled can be configured where the error is thrown.
The error type now holds an underlying source error so that the error
metrics layer can see through this wrapper type to track the
underlying error cause.
Server & Authorization labels are no longer handled as a free-form
maps. We currently read only the name label from each; and this
label is required.

…c error types

…rticular)

Now the `name` label is required on both servers and authorizations. Other labels are ignored in metrics.

We recently introduced metrics to help surface inbound policy decisions, but in practice these haven't been as useful as we might hope. Specifically, error metrics do not include the `target_addr` label so these metrics can't be correlated with servers, etc. This change improves error metrics and also introduces new metrics to describe authorization decisions: authorization denials shouldn't be classified as errors, really, anyway. This change also improves TCP forwarding authorization so that policy changes can be honored at runtime: previously authorized connections may dropped if the policy is updated so that the connection is no longer authorized. The gateway is also updated to enforce HTTP policies at runtime as well so that policy changes can be honored after the connection has been established. This change introduces new metrics: * `inbound_http_authz_allow_total` * `inbound_http_authz_deny_total` * `inbound_tcp_authz_allow_total` * `inbound_tcp_authz_deny_total` * `inbound_tcp_authz_terminate_total` _allow_ metrics include `target_addr`, `srv_name`, and `saz_name` labels. _deny_ and _terminate_ metics include only `target_addr` and `srv_name` labels. Authorization denials are no longer reflected in inbound_tcp_error or inbound_http_error metrics. A number of internal changes have been made to support this: * The `inbound::policy::authorize` module includes middlewares for TCP and HTTP authorization, replacing the prior method of enforcing policy in the stack/router. This module ensures that metrics are recorded for policy decisions. * The `error-metrics` crate has been removed. In its place a `monitor` type has been added to the `stack` crate, supporting a general way to observe errors, decoupled from the metrics registry. * Inbound and outbound error metrics are now tracked in the inbound and outbound crates, respectively. Inbound- and outbound-specific error types are also moved into their rspective crates. * The `app_core::errors` module has been updated to only define the types it needs to instrument the error response layer. Error responses are now primarily instrumented via the `HttpError` type so that errors that should be handled can be configured where the error is thrown. The error type now holds an underlying source error so that the error metrics layer can see through this wrapper type to track the underlying error cause. * Server & Authorization labels are no longer handled as a free-form maps. We currently read only the `name` label from each; and this label is required.

hawkw

overall, this looks good to me, i had mostly minor comments. i do think we may want to just replace the RwLocks with Mutexes, because it seems like we're not really getting any benefit from them --- we never have concurrent reads, and it might just be adding overhead.

hawkw · 2021-09-03T00:04:18Z

linkerd/app/core/src/errors/respond.rs

+                // should not be calling poll_data if we have set trailers derived from an error
+                assert!(trailers.is_none());


nit/TIOLI: should we maybe just move the comment into the panic message for the assert?

linkerd/app/inbound/src/detect.rs

hawkw · 2021-09-03T00:22:51Z

linkerd/app/inbound/src/http/router.rs

@@ -355,6 +345,7 @@ impl tap::Inspect for Logical {
    }

    fn dst_labels<B>(&self, _: &http::Request<B>) -> Option<&tap::Labels> {
+        // TODO include policy labels here.


does this need to be addressed as part of this PR, or is it a follow-up change?

there's a followup change already up

linkerd/app/inbound/src/lib.rs

linkerd/app/inbound/src/metrics/authz.rs

linkerd/app/inbound/src/metrics/error/tcp.rs

linkerd/app/inbound/src/policy/authorize/tcp.rs

linkerd/app/inbound/src/policy/mod.rs

linkerd/app/outbound/src/metrics/error/http.rs

linkerd/stack/src/monitor.rs

This release features some performance improvements: tokio has been updated to pick up tokio-rs/tokio#4055, and link-time optimizations have been enabled in release builds. These changes reduce CPU and memory overhead in benchmarks. Inbound policy enforcement has been updated so that TCP forwarding is interrupted if a policy update revokes a previously-established authorization. New metrics are exposed to reflect how policies are used by the proxy: `inbound_http_authz_{allow,deny}_total` and `inbound_tcp_authz_{allow,deny,terminate}_total`. The proxy's error metrics, `{inbound,outbound}_{http,tcp}_errors_total`, have been updated to include the traffic target. And the `traffic_addr` metric label is augmented by `target_ip` and `target_port` labels to support more flexible prometheus queries. Inbound TCP metrics now only include a `srv_name` label, as it can't be expected for all inbound connections to include authorization labels (hence the new authz metrics). However, all inbound HTTP metrics--except for the HTTP errors metric, which includes only a `srv_name` label--include both `srv_name` and `saz_name` label. Finally, the inbound and outbound proxies now only exports Route-oriented metrics when a ServiceProfile is enabled, preventing redundant metrics from being exported with no differentiating labels. --- * profiles: Avoid creating a default route stack (linkerd/linkerd2-proxy#1223) * build(deps): bump arbitrary from 1.0.1 to 1.0.2 (linkerd/linkerd2-proxy#1224) * build(deps): bump trust-dns-resolver from `f08860c` to `3d0667a` (linkerd/linkerd2-proxy#1225) * build(deps): bump libc from 0.2.100 to 0.2.101 (linkerd/linkerd2-proxy#1226) * Enable link-time optimizations (linkerd/linkerd2-proxy#1227) * build(deps): bump serde_json from 1.0.66 to 1.0.67 (linkerd/linkerd2-proxy#1228) * build(deps): bump flate2 from 1.0.20 to 1.0.21 (linkerd/linkerd2-proxy#1230) * build(deps): bump thiserror from 1.0.26 to 1.0.28 (linkerd/linkerd2-proxy#1231) * build(deps): bump futures from 0.3.16 to 0.3.17 (linkerd/linkerd2-proxy#1232) * build(deps): bump parking_lot from 0.11.1 to 0.11.2 (linkerd/linkerd2-proxy#1234) * build(deps): bump trust-dns-resolver from `3d0667a` to `v0.21.0-alpha.2` (linkerd/linkerd2-proxy#1233) * Rename push_on_response to push_on_service (linkerd/linkerd2-proxy#1235) * build(deps): bump tokio from 1.10.1 to 1.11.0 (linkerd/linkerd2-proxy#1236) * metrics: Add `target_ip` and `target_port` labels (linkerd/linkerd2-proxy#1238) * inbound: Improve policy metrics (linkerd/linkerd2-proxy#1237) * inbound: Include server labels in tap responses (linkerd/linkerd2-proxy#1239) * Revert rustc update for release builds

olix0r added 30 commits August 30, 2021 19:58

Split error metrics tracking from stack layer

4695f77

Move deny errors into errors mod

8b800ab

Ensure connect timeouts are annotated properly; remove dispatch timeout

fcf47d6

wip: start moving inbound metrics into inbound

6fc5ea1

wip: Move outbound metrics into outbound

67bdba9

Restore HttpError type to decouple error responder from proxy-specifi…

56315f8

…c error types

Move errors into their respective crates

4a6b3bd

comments

10c8a8b

Merge branch 'main' into ver/monitor

ded6fe4

fmt

13981e5

Include target_addr in inbound HTTP errors

adb07b6

fix gateway error tests

784b9b1

Merge branch 'main' into ver/monitor

a4719e9

wip

8943d00

wip: separate http authorization layer

30f06c7

Separate TCP authorization layer

d8b20f6

wip

53f41ed

Always permit connections from localhost (for the admin server, in pa…

e38cb78

…rticular)

back-out admin policy

32dddfe

lint

3e4591a

wire up authz metrics with minimal labeling

185ab97

cleanup policy labels

976f4b6

fix telemetry tests

b7ab6e7

Stop handling policy labels as free-form maps.

1a08b08

Now the `name` label is required on both servers and authorizations. Other labels are ignored in metrics.

fixup error metric labels

8fea62d

backout loopback authorization changes

0b48476

metrics initialization touchup

a883aca

Merge branch 'main' into ver/monitor

a881656

report authz metrics

238d77b

Set default labels on the admin server

e745ca6

olix0r added 6 commits September 2, 2021 19:42

admin stack doesn't need inbound metrics handle

e014101

back-out unneeded runtime changes

5c58be6

smaller gateway change

44a419c

+comments

8ba1a08

back out env changes

662b6e7

olix0r requested a review from a team September 2, 2021 20:57

olix0r added 6 commits September 2, 2021 21:13

fixup fuzzer

9c025b5

Restore http error metrics on the admin server

93907af

fix fuzzer

230b707

Merge branch 'main' into ver/monitor

8773eba

indent

4c8633c

-comment

9b4020b

hawkw approved these changes Sep 3, 2021

View reviewed changes

olix0r added 6 commits September 3, 2021 01:14

dedupe forward stack

bf87bc5

fixup

e24fefa

We don't really gain anything from using a RwLock

dd50466

fix tcp metrics comments

632c1b8

Add a comment for AllowPolicy::changed

f390a29

fixup fuzzer

e25bf93

olix0r merged commit 29c22af into main Sep 3, 2021

olix0r deleted the ver/monitor branch September 3, 2021 02:26

olix0r mentioned this pull request Sep 3, 2021

proxy: v2.155.0 linkerd/linkerd2#6802

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inbound: Improve policy metrics #1237

inbound: Improve policy metrics #1237

olix0r commented Sep 2, 2021

hawkw left a comment

hawkw Sep 3, 2021

hawkw Sep 3, 2021

olix0r Sep 3, 2021

		// should not be calling poll_data if we have set trailers derived from an error
		assert!(trailers.is_none());

inbound: Improve policy metrics #1237

inbound: Improve policy metrics #1237

Conversation

olix0r commented Sep 2, 2021

hawkw left a comment

Choose a reason for hiding this comment

hawkw Sep 3, 2021

Choose a reason for hiding this comment

hawkw Sep 3, 2021

Choose a reason for hiding this comment

olix0r Sep 3, 2021

Choose a reason for hiding this comment